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1 S pecial session on memory wall: A first glance at Kilo-instruction based 
multiprocessors 

Marco Galluzzi, Valentin Puente, Adrian Cristal, Ramon Beivide, Jose-Angel Gregorio, Mateo 
Valero 

April 2004 Proceedings of the first conference on computing frontiers on Computing 
frontiers 

Full text available: *g pdf(226.85 KB) Additional Information: full citation , abstract , references , index terms 

The ever increasing gap between processor and memory speed, sometimes referred to as 
the Memory Wall problem [42], has a very negative impact on performance. This mismatcl 
will be more severe in future processor's generation. Modern cache organizations and 
prefetching techniques will not be able to solve this problem. A very novel and promising 
technique to deal with the Memory Wall consists on designing processors able to maintain 
thousands of in-flight instructions. An example of ... 



Keywords: CC-NUMA, Kilo-instruction processors, ROB, in-flight instructions, instruction 
window, memory wall, shared-memory multiprocessors 



2 IPStash: a Power-Ef f i cient Mem o r y Architecture for I P-looku p 
Stefanos Kaxiras, Georgios Keramidas 

December 2003 Proceedings of the 36th Annual IEEE/ACM International Symposium c 

M i croa rch itectu re 

Full text available: fi3 pdf(293.97 KB ) 

isf Additional Information: full citation , abstract , index terms 

Publisher Site 

High-speed routers often use commodity, fully-associatlve,TCAMs (Ternary Content 
AddressableMemories) to perform packet classification and routing(IP-lookup). We propose 
a memory architecture calledlPStash to actasa TCAMreplacement,offering atthesame time, 
better functionality, higher performance,and significant power savings. The premise of our 
workis that full associativity is not necessary for IP-lookup. Rather, we show that the 
required associativity is simplya function of the routing table s ... 

3 Cache coherence in large-scale shared-memory multiprocessors: issues and 
comparisons 

David J. Lilja 

September 1993 ACM Computing Surveys (CSUR), volume 25 issue 3 

Full text available: ^pdf(3.12 MB) Additional Information: full citation , references , citings , index terms 



4 

Minimization of memory traffic in hi g h-level synthesis 



David J. Kolson, Alexandru Nicolau, Nikil Dutt 

June 1994 Proceedings of the 31st annual conference on Design automation 

Full text available: ^ pdf(205.30 KB) Additional Information: full citation , references , citings , index terms 



5 Learning not to share 
Jason Liu, David Nicol 

May 2001 Proceedings of the fifteenth workshop on Parallel and distributed 
simulation 

Full text available: | B pdf(779.87 KB) Additional Information: full citation , abstract , references , citings, index 
W P ublisher Site ^OBS 

Strong reasons exist for executing a large-scale discrete-event simulation on a cluster of 
processor nodes (each of which may be a shared-memory multiprocessor or a uniprocessor 
This is the architecture of the largest scale parallel machines, and so the largest simulation 
problems can only be solved this way. It is a common architecture even in less esoteric 
settings, and is suitable for memory-bound simulations. This paper describes our approach 
to porting the SSF simulation kernel to t ... 



6 Architecture and implementation of a VLIW supercomputer 

Robert P. Colwell, W. Eric Hall, Chandra S. Joshi, David B. Papworth, Paul K. Rodman, James 
E. Tornes 

November 1990 Proceedings of the 1990 ACM/IEEE conference on Supercomputing 

Full text available: ^pdj(1 ,29 MB) Additional Information: full ci t ation , abstract, references 

Very-Long-Instruction-Word (VLIW) computers achieve high performance by exploiting the 
fine-grain parallelism present in sequential or vectorizable code. Multiflow's /200 and /300 
VLIW systems yielded near-supercomputer performance by this means despite the relative 
slow (65 nS) clocks. With its much faster clock period (15 nS) and architectural 
improvements, the new /500 system attains approximately 4-9X the performance of its 
predecessors.This paper describes the /500 architecture and implem ... 



7 Fast detec tion o f communication patterns in distributed executions 
Thomas Kunz, Michiel F. H. Seuren 

November 1997 Proceedings of the 1997 conference of the Centre for Advanced Studic 
on Collaborative research 

Full text available: Qpdf(4.21 MB) Additional Information: full citation , abstrac t, references , index terms 

Understanding distributed applications is a tedious and difficult task. Visualizations based c 
process-time diagrams are often used to obtain a better understanding of the execution of 
the application. The visualization tool we use is Poet, an event tracer developed at the 
University of Waterloo. However, these diagrams are often very complex and do not provic 
the user with the desired overview of the application. In our experience, such tools display 
repeated occurrences of non-trivial commun ... 



The Flux OSKit: a substrate for kernel and lan g ua g e resea rch 

Bryan Ford, Godmar Back, Greg Benson, Jay Lepreau, Albert Lin, Olin Shivers 

October 1997 ACM SIGOPS Operating Systems Review , Proceedings of the sixteenth 

ACM symposium on Operating systems principles, volume 31 issue 5 
Full text available: Q pdf(2.47 MB) Additional Information: full citation , references , citings , index terms 



EPIC compilation: Optimizations to prevent cache penalties for the Intel© Itanium© 2 
Processor 

Jean-Francois Collard, Daniel Lavery 

March 2003 Proceedings of the international symposium on Code generation and 
optimization: feedback-directed and runtime optimization 

Full text available:^ OQ (f| 

■g jpaui.^B Mb j.^ Additional Information: full citation , abstract , references , index terms 
Publisher Site 



This paper describes scheduling optimizations in the Intel® Itanium® compiler to prevent 
cache penalties due to various micro-architectural effects on the Itanium 2 processor. This 
paper does not try to improve cache hit rates but to avoid penalties, which probably all 
processors have in one form or another, even in the case of cache hits. These optimization 
make use of sophisticated methods for disambiguation of memory references, and this 
paper examines the performance improvement obt ... 



10 Parallel processin g: a smart com piler and a dumb machine 
Joseph A. Fisher, John R. Ellis, John C. Ruttenberg, Alexandru Nicolau 

June 1984 ACM SIGPLAN Notices , Proceedings of the 1984 SIGPLAN symposium on 

Compiler construction, volume 19 issue 6 
Full text available: ^?| pdf(1.05 MB) Additional Information: full citation , abstract , references , citings 

Multiprocessors and vector machines, the only successful parallel architectures, have 
coarse-grained parallelism that is hard for compilers to take advantage of. We've develope 
a new fine-grained parallel architecture and a compiler that together offer order-of- 
magnitude speedups for ordinary scientific code. 

11 Analytic evaluation of shared-memory systems with I LP processors 
Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, David A. Wood 

April 1998 ACM SIGARCH Computer Architecture News , Proceedings of the 25th 

annual international symposium on Computer architecture, volume 26 issue 3 
Full text available: ^^ 45 M g) Additional Information: full citation , abstract , references , citings, index 
Publisher Site teDES 

This paper develops and validates an analytical model for evaluating various types of 
architectural alternatives for shared-memory systems with processors that aggressively 
exploit instruction-level parallelism. Compared to simulation, the analytical model is many 
orders of magnitude faster to solve, yielding highly accurate system performance estimate: 
in seconds.The model input parameters characterize the ability of an application to exploit 
instruction-level parallelism as well as the interac ... 



12 Trajectory sampling for direct traffi c obser vation 
N. G. Duffield, Matthias Grossglauser 

June 2001 IEEE/ ACM Transactions on Networking (TON), volume 9 issue 3 

Full text available: 1U pdf(251 .55 KB) AdditionaI Information: MLcitation, Abstract, references, citings, index 
' ^ " : terms 

Traffic measurement is a critical component for the control and engine ering of 
communication networks. We argue that traffic measurement should make it possible to 
obtain the spatial flow of traffic through the domain, i.e., the paths followed by packets 
between any ingress and egress point of the domain. Most resource allocation and capacity 
planning tasks can benefit from such information. Also, traffic measurements should be 
obtained without a routing model and without knowledge of netw ... 

Keywords: Hash functions, Internet traffic measurement, packet sampling, traffic 
engineering 



13 A high-performance network intrusion detection system 
R. Sekar, Y. Guang, S. Verma, T. Shanbhag 

November 1999 Proceedings of the 6th ACM conference on Computer and 
communications security 

Full text available* 15! odfd 04 MB) Additional Information: full citation , abstract, re ferences , citing s, index 
• __ _.. terms 

In this paper we present a new approach for network intrusion detection based on concise 
specifications that characterize normal and abnormal network packet sequences. Our 
specification language is geared for a robust network intrusion detection by enforcing a str 
type discipline via a combination of static and dynamic type checking. Unlike most previoui 
approaches in network intrusion detection, our approach can easily support new network 
protocols as information relating to the protoco ... 



14 Combining hardware and software cache coherence strategies 
David J. Lilja, Pen-Chung Yew 

June 1991 Proceedings of the 5th international conference on Supercomputing 

Full text available: pdf( 979.07 KB ) Additional Information: full citation , references , citings, index terms 



15 TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP 

Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh 
Nitya Ranganathan, Doug Burger, Stephen W. Keckler, Robert G. McDonald, Charles R. Moon 
March 2004 ACM Transactions on Architecture and Code Optimization (TACO), volume l 
Issue 1 

Full text available: Q pdf(832.30 KB) Additional Information: full citation , abstract , references , index terms 

This paper describes the polymorphous TRIPS architecture that can be configured for 
different granularities and types of parallelism. The TRIPS architecture is the first in a class 
of post-RISC, dataflow-like instruction sets called explicit data-graph execution (EDGE). Th 
EDGE ISA is coupled with hardware mechanisms that enable the processing cores and the 
on-chip memory system to be configured and combined in different modes for instruction, 
data, or thread-level parallelism. To adapt ... 

Keywords: Computer architecture, configurable computing, scalable and high-performanc 
computing 



16 The effect of real data cache behavior on the performance of a microarchitecture that 
supports dynamic scheduling 
Michael Butler, Yale Patt 

September 1991 Proceedings of the 24th annual international symposium on 
M icroa rch itectu re 

Full text available: fiQ pdf(691.87 KB) Additional Information: full citation , references , citings , index terms 



17 Tra j ectory sampling for direct traffic observation 
N. G. Duffield, M. Grossglauser 

August 2000 ACM SIGCOMM Computer Communication Review , Proceedings of the 

conference on Applications, Technologies, Architectures, and Protocols f 
Computer Communication, volume 30 issue 4 

Full text available- fg| pdf(421 07 KB) Additional Information: full citation , abstract , references , citings , index 
^ terms 

Traffic measurement is a critical component for the control and engineering of 
communication networks. We argue that traffic measurement should make it possible to 
obtain the spatial flow of traffic through the domain, i.e., the paths followed by packets 
between any ingress and egress point of the domain. Most resource allocation and capacity 
planning tasks can benefit from such information. Also, traffic measurements should be 
obtained without a routing model and without knowledge of netwo ... 

18 POP standard transmission control protocol 
Jon Postel 

October 1980 ACM SIGCOMM Computer Communication Review, volume 10 issue 4 
Full text available: ^ pdf(4.83 MB) Additional Information: full citation , references 



19 New models and architectures: Spatial computation 

Mihai Budiu, Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein 
October 2004 Proceedings of the 11th international conference on Architectural 
support for programming languages and operating systems 

Full text available: ^ pdf(573.Q0 KB) Additional Information: full citation , abstract , references , index terms 

This paper describes a computer architecture, Spatial Computation (SC), which is based or 



the translation of high-level language programs directly into hardware structures. SC 
program implementations are completely distributed, with no centralized control. SC circuit 
are optimized for wires at the expense of computation units.In this paper we investigate a 
particular implementation of SC: ASH (Application-Specific Hardware). Under the 
assumption that computation is cheaper than co ... 

Keywords: application-specific hardware, dataflow machine, low-power, spatial 
computation 

20 Informatics: semantic processing : How to deal with ambi g uities while parsing: EXAM- 
a semantic processing system for Japanese language 
Hidetosi Sirai 

September 1980 Proceedings of the 8th conference on Computational linguistics 

Full text available: Q pdf(547.39 KB) Additional Information: full citation , abstract , references 

It is difficult for a natural language understanding system (NLUS) to deal with ambiguities. 
There is a dilemma: an NLUS must be able to produce plausible interpretations for given 
sentences, avoiding the combinatorial explosion of possible interpretations. Furthermore, it 
is desirable for an NLUS to produce several interpretations if they are equally plausible. 
EXAM, the system described in this paper, is an experimental text understanding system 
designed to deal with ambiguities effectively an ... 
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